Skip to content

Replace hardcoded init.krun with generic virtual file overlay#673

Draft
mtjhrc wants to merge 14 commits into
containers:mainfrom
mtjhrc:virtual-inodes-v1
Draft

Replace hardcoded init.krun with generic virtual file overlay#673
mtjhrc wants to merge 14 commits into
containers:mainfrom
mtjhrc:virtual-inodes-v1

Conversation

@mtjhrc
Copy link
Copy Markdown
Collaborator

@mtjhrc mtjhrc commented May 11, 2026

This PR replaces the hardcoded init.krun handling in the virtiofs passthrough backends with a generic virtual-files overlay (AugmentFs).

This introduces 2 new filesystem trait implementations:

  • AugmentFs<T>, a wrapper that intercepts FUSE operations for virtual inodes - synthetic read-only files/directories backed by static data. It also handles our custom ioctls
  • NullFs, a minimal FileSystem impl with just an empty root directory — used when no host directory is needed

The init.krun is registered as just a virtual file from the API layer. As a bonus you can even inject the .krun_config.json as a virtual file.

Reimplemented krun_set_root_disk_remount() via NullFs+AugmentFs #551 (comment)

The public API is still mostly compatible. There are minor differences like init.krun dissapears after it has been looked up once.

API breaking changes - applying krun_disable_implicit_init() and other disable_implicit_* will be applied by default in a follow up PR.

The init binary is now in its own init-blob crate. The direction for #634 (2.0 API) is to invert the dependency: init-blob would depend on libkrun's overlay APIs to inject itself, rather than libkrun depending on a specific init.

This supersedes #593 by @ggoodman, which tackled the same problem of decoupling init from the fs backends. This PR takes that idea further by removing awareness of init from the filesystem layer entirely - it's just another virtual file. #593 also introduced InitPolicy startup validation - how that fits into the 2.0 API (#634) with different payload types is still an open question.

Known limitations / future work:

  • Virtual inodes don't appear in readdir (pre-existing — init.krun was also lookup-only)
  • The EXPORT_FD ioctl (GPU cross-domain shared memory) remains in passthrough for now
  • No DAX (setupmapping) for virtual files on macOS (pre-existing — init.krun never had DAX on macOS either)
  • DAX setupmapping/removemapping layering: the overlay creates mappings but the inner passthrough tears them down (works correctly but is architecturally messy)

@ggoodman
Copy link
Copy Markdown
Contributor

No comments on the code but I really love the direction!

@jakecorrenti
Copy link
Copy Markdown
Member

@mtjhrc do you want this to merge before #670 or after?

@mtjhrc mtjhrc force-pushed the virtual-inodes-v1 branch 3 times, most recently from 95eac11 to 429bfe3 Compare May 13, 2026 11:47
mtjhrc added 14 commits May 13, 2026 13:48
Move the init binary build script and include_bytes!() from the
devices crate into a new init-blob crate. The passthrough modules
reference the binary as init_blob::INIT_BINARY instead of using
include_bytes! directly.

Inspired by containers#593 by Geoffrey Goodman <geoff@goodman.dev>.
Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Replace the private next_inode AtomicU64 inside PassthroughFs with a
shared InodeAllocator that is passed in at construction. This lets
multiple layers (e.g. a future virtual-inode overlay) allocate from
the same counter without implicit coordination via reserved ranges.

The allocator starts at ROOT_ID + 2, reserving inode 2 for the
existing init_inode in PassthroughFs. This reservation is removed
in the next commit when init handling moves to AugmentFs.

PassthroughFs::new() and PassthroughFsRo::new() now take an
Arc<InodeAllocator> parameter. FsWorker::new() creates the allocator
and passes it through.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Introduce AugmentFs<T>, a generic overlay that wraps any FileSystem
implementation and intercepts FUSE operations for virtual inodes —
synthetic read-only files and directories backed by static data.
One-shot files can only be looked up once.

Remove all init.krun special-case code (init_inode, init_handle,
INIT_CSTR) from both the Linux and macOS passthrough implementations.
The init.krun virtual file is now configured via VirtualDirEntry in
the krun API layer and handled generically by the overlay.

FsDeviceConfig carries a Vec<VirtualDirEntry> and FsWorker wraps
AugmentFs<PassthroughFs> / AugmentFs<PassthroughFsRo>.

The InodeAllocator now starts at ROOT_ID + 1 since the init_inode
reservation is no longer needed.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add API to prevent the default init binary (/init.krun) from being
injected into the root filesystem. Follows the existing
krun_disable_implicit_{console,vsock} pattern.

Must be called before krun_set_root().

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add C APIs to inject virtual files and directories into a virtiofs
device. Entries are backed entirely by host memory (no host file).
Files support one-shot semantics (disappear after the first lookup).

Paths may contain '/' to nest entries inside existing virtual
directories (e.g. krun_fs_add_overlay_dir for "etc", then
krun_fs_add_overlay_file for "etc/hostname"). Intermediate
directories must already exist; -ENOENT / -ENOTDIR is returned
otherwise.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add API to retrieve the built-in default init binary. Callers that
use krun_disable_implicit_init() can use this to obtain the init
binary and inject it themselves via krun_fs_add_overlay_file().

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
NullFs implements the FileSystem trait with just an empty root
directory. It can be wrapped with AugmentFs to serve virtual
files without any host directory involvement.

Fs::new() now accepts Option<String> for shared_dir — None selects
NullFs. FsDeviceConfig and FsServer gain the corresponding variants.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
krun_set_root_disk_remount no longer creates a temporary empty host
directory. Instead it configures a NullFs-backed virtiofs device
(shared_dir: None) with init.krun overlaid via AugmentFs.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
The temporary root directory hack is gone (replaced by NullFs), so
the ioctl that cleaned it up and the config flag that gated it are
no longer needed. Remove allow_root_dir_delete from FsDeviceConfig,
Fs::new(), passthrough Config, and all call sites.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
The exit-code ioctl is a krun mechanism, not a filesystem operation.
Move it to the AugmentFs overlay where it is handled before any
delegation to the inner filesystem.

The Linux passthrough retains only EXPORT_FD (which needs access to
passthrough-internal handle and export tables). The macOS passthrough
no longer implements ioctl at all.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Boot a VM with a pure NullFs root — no host directory at all. Every
file in the root (init.krun, guest-agent, .krun_config.json, test
data) is injected as a virtual overlay, and /dev, /proc, /sys are
virtual empty directories used as mount points.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Boot from an ext4 block device via krun_set_root_disk_remount. The
virtiofs root uses NullFs with init.krun and virtual mount-point
directories overlaid. The guest verifies it pivoted to the block
device root successfully.

Uses dlsym for krun_add_disk/krun_set_root_disk_remount so the test
compiles without BLK and skips gracefully at runtime.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Build and test with the block device feature so the root-disk-remount
test runs in CI. Install e2fsprogs (provides mke2fs) which the test
needs to create the ext4 disk image.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
@mtjhrc mtjhrc force-pushed the virtual-inodes-v1 branch from 429bfe3 to 4991f9b Compare May 13, 2026 11:54
@ggoodman
Copy link
Copy Markdown
Contributor

As a thought, while you're implementing this, @mtjhrc, is there a reasonable opportunity to generalize this to supporting arbitrary virtual files (at least in a ro capacity)?

I actually have a use-case where I want a procfs-like filesystem provides by the host. The contents of this filesystem would all be programmatically-backed. Making it r/w sounds like a whole other beast but wanted to seed the idea just in case it could become a logical extension of what you're already hacking on.

@mtjhrc
Copy link
Copy Markdown
Collaborator Author

mtjhrc commented May 13, 2026

supporting arbitrary virtual files (at least in a ro capacity)?

Yes I thought about that, it would be pretty cool. But I would really want to first land some Rust API. And then we can just easily have a VirtualFile trait. I don't think writable files would be that much harder, listing directories looked more tricky though (making the virtual files appear in ls mixed along real files).

@mtjhrc
Copy link
Copy Markdown
Collaborator Author

mtjhrc commented May 13, 2026

@jakecorrenti

@mtjhrc do you want this to merge before #670 or after?

Actually, I think I changed my mind we can merge this before #670, but if we do it after it doesn't matter either way.

@jakecorrenti
Copy link
Copy Markdown
Member

jakecorrenti commented May 13, 2026

@jakecorrenti

@mtjhrc do you want this to merge before #670 or after?

Actually, I think I changed my mind we can merge this before #670, but if we do it after it doesn't matter either way.

I already did an initial rebase on top of your commits so let's just do this one first.

@slp
Copy link
Copy Markdown
Collaborator

slp commented May 13, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a virtual inode overlay system (AugmentFs) for virtiofs, allowing the injection of synthetic, memory-backed files and directories into the guest filesystem. It includes a new NullFs implementation for virtual-only filesystems, an InodeAllocator for unique FUSE inode management, and updated API functions to support custom init binary injection and overlay file/directory creation. I have reviewed the changes and agree with the feedback regarding the unnecessary restriction on empty virtual files in krun_fs_add_overlay_file.

Comment thread src/libkrun/src/lib.rs
Comment on lines +2564 to +2566
if c_fs_tag.is_null() || c_path.is_null() || data.is_null() || data_len == 0 {
return -libc::EINVAL;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check data_len == 0 prevents the creation of empty virtual files. This seems like an unnecessary restriction, as empty files are a valid use case (e.g., for lock files or placeholders). Consider removing this part of the condition to allow creating zero-length files. The data.is_null() check should be kept, as slice::from_raw_parts requires a non-null pointer even for zero-length slices.

    if c_fs_tag.is_null() || c_path.is_null() || data.is_null() {
        return -libc::EINVAL;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants